Improved generation of prosodic features in HMM-based Mandarin speech synthesis
نویسندگان
چکیده
The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. However, the prosodic features, like F0 and duration trajectories, generated by HMM-based speech synthesis are often excessively smoothed and lack prosodic variance. In HMM-based TTS durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts without high-level linguistic knowledge. And F0 trajectory is generated by the MSD-HMMs as a weighted bias term. In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSD-HMM, the generated F0 values are limited in accuracy. In this paper, in order to improve the prosodic features generation against the standard HMM framework, an F0 generation process model is used to re-estimate F0 values in the regions of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Also we design a set of syntax features to improve Mandarin phoneme duration prediction.
منابع مشابه
Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech
Expressive speech synthesis has received increased attention in recent times. Stress (or pitch accent) is the perceptual prominence within words or utterances, which contributes to the expressivity of speech. This paper summarizes our contribution to Mandarin expressive speech synthesis. A novel hierarchical stress modeling and generation method for Mandarin is proposed and further integrated i...
متن کاملUsing FO Contour Generation Process Model for Improved and Flexible Control of Prosodie Features in HMM-based Speech Synthesis
Generation process model of fundamental frequency contours known as Fujisaki's model is ideal to represent global features of prosody. It is a command response model, where the commands have clear relations with linguistic and para/non linguistic information included in the utterance. Therefore, by controlling fundamental frequency contours in the framework of the generation process model, a mo...
متن کاملConcept-to-speech generation by integrating syntagmatic features into HMM-based speech synthesis
In conventional concept-to-speech (CTS) methods, a common step is predicting abstract prosodic descriptions, such as the locations of accents and phrase boundaries, from the linguistic information provided by the text generation module. But the prediction results always contain errors, and unacceptable prosodic prediction may ruin the synthesized speech. In addition, linguistic information, whi...
متن کاملGeneration of Fundamental Frequency Contours of Mandarin in HMM-based Speech Synthesis using Generation Process Model
The HMM-based speech synthesis system can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. In this approach, short term spectra, fundamental frequency (F0) and duration are generated by multi-stream HMMs separately. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tr...
متن کاملUnsupervised prosody labeling for constructing Mandarin TTS
This paper introduces an unsupervised prosody labeling method for preparing a large speech corpus used in developing a Mandarin Text-to-Speech system. Adopting a four-layer prosody hierarchy, the proposed method performs an unsupervised segmental clustering that iteratively segments spoken utterances into strings of prosodic constituents and models the patterns of the segmented prosodic constit...
متن کامل